In [None]:
# Install required packages
!pip install --upgrade --quiet natural_pdf

print('✓ Packages installed!')

**Slides:** [slides.pdf](./slides.pdf)

# Installing Natural PDF

There are a LOT of possible extras (a lot of them AI-flavored) inside of Natural PDF, but we'll start by just installing the basics. You use `"natural_pdf[all]"` if you want *everything*.

# Opening a PDF

**We'll start by opening a PDF.**

You can use a PDF on your own computer, or you can use one from a URL. I'll start by using one from a URL to make everything a bit easier.

In [None]:
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
pdf

You can find the pages of the pdf under `pdf.pages`, let's grab the first one.

In [None]:
page = pdf.pages[0]
page

Pretty boring so far, eh? Let's take a look at the page itself.

In [None]:
page.show()

Incredible!!! Congratulations, you've opened your first PDF with Natural PDF.

# Grabbing page text

Most of the time when we're working with PDFs you're interested in the text on the page.

`layout=True` is a useful addition if you want to see a text-only representation of the page, and sometimes it helps with data extraction.

# Selecting elements and grabbing specific text

You rarely want all of the text, though. How would you describe the **INS-UP70N51NCL41R** text?

- It's in a box
- It's the second text on a page
- It's red
- It starts with "INS"

## Selecting objects: "It's in the box"

### Selecting multiple objects: "It's the second piece of text"

### Finding by attributes: "It's the red text"

## Searching by text: "It starts with INS-"

What about "Chicago, Ill."? It's grey, so...

# Learning about the page

How do we know what's on the page? `page.describe()` can help!

In [None]:
page.describe()

In [None]:
page.find_all('text').inspect()

Let's find the **largest text** that's also Helvetica

## Spatial navigation

What else is on the page that we can extract? How about the **date?** We want to find **Date:** and grab everything to the right of it.

And the **site?** We want to grab 'site', then keep going right until we see a piece of text.

How about **Violation Count?**

The **Summary** is a little bit more difficult. How would you describe where it is?

## Grabbing tables

Everyone loves extracting tables from PDFs! You can do that here: just do `page.extract_table()`. Easy!!!

In [None]:
table = page.extract_table()
table

In [None]:
table.to_df()

What about a page with **multiple tables?**

In most PDF processing libraries you just say, "give me all of the tables!" and then figure out which one you want. In Natural PDF, the _proper_ way to do it is find the area you know the table is in and extract it alone. 

In [None]:
# Start from the bold, big text that says "Violations" and header down to the smallest text
(
    page.find('text[size=max()]:bold:contains("Violations")').below(
        until='text[size=min()]',
        include_endpoint=False
    )
    .trim()
).show(crop=True)

In [None]:
# Start from the bold, big text that says "Violations" and header down to the smallest text
(
    page.find('text[size=max()]:bold:contains("Violations")').below(
        until='text[size=min()]',
        include_endpoint=False
    )
    .trim()
).extract_table().to_df()

# Ignoring text with exclusion zones

What if we have like two hundred of these forms, and they all look the same, and all we want is the top, text-y part?

Instead of writing code about what we *want*, we can also write code about what we *don't want*. These are called [**exclusion zones**](https://jsoma.github.io/natural-pdf/tutorials/05-excluding-content/).

In [None]:
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]

In [None]:
text = page.extract_text()
print(text)

In [None]:
top = page.region(top=0, left=0, height=80)
bottom = page.find("line[width>=2]").below()
(top + bottom).show()

In [None]:
page.add_exclusion(top)
page.add_exclusion(bottom)

page.show(exclusions='red')

In [None]:
text = page.extract_text()
print(text)

Any time there is recurring text - headers, footers, even *stamps on the page you want to ignore*, you can just add them as an exclusion. 

It's also possible to add exclusions across *multiple pages*. In the example below, every time you load a new page up it applies the PDF-level exclusion on it. Write it once, be done with it forever!

In [None]:
pdf.add_exclusion(lambda page: page.region(top=0, left=0, height=80))
pdf.add_exclusion(lambda page: page.find("line[width>=2]").below())

## Next steps

What about **when the text isn't so easy to access?** Time to move on to our next notebook!